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Abstract. We compare in this paper several feature selection methods for 
the Naive Bayes Classiher (NBC) when the data under study are described 
by a large number of redundant binary indicators. Wrapper approaches 
guided by the NBC estimation of the classification error probability out¬ 
perform filter approaches while retaining a reasonable computational cost. 


1 Introduction 

We consider in this paper application contexts in which a large body of expert 
knowledge is available as a series of simple and low level parametric scores. 
The goal is to build an interpretable classifier from this knowledge (and from a 
learning set). Then the scores are assumed to be simple parametric functions 
from the data space to M with the interpretation that a high value of a score 
indicates that the datum submitted to the score belongs probably to the class 
that the score has been designed to detect. 

Let’s consider a concrete example from our main application domain, aircraft 
engine monitoring (see [7] for details). We aim here at classifying some short time 
series (around 150 time points, each series having its own specific length) into 
different classes (normal signal and different types of anomalies corresponding 
to some non stationarity in the signal). Domain experts have selected a set of 
statistical tests as scores. For instance, the Mann-Whitney U test can be used to 
reject the null hypothesis that two populations are identical. It can be applied 
to a time series by selecting a potential break point in the series tf, and a window 
size w, and by considering the w/2 points before ti, as the first population and 
the w/2 points after tb as the second population. The score is the p-value of 
the U test applied to those populations: a high value leads to not rejecting the 
null hypothesis and thus is an indication that the time series belongs to “no 
anomaly” class. Notice that the parameters of the score are here the potential 
break point tb and the window size w. See [7] for other examples. 

While the experts can design scores, they are seldom able to provide more 
than hints about the parameters and the thresholds (i.e. when to consider that 
the score is “high enough”). In addition, the scores are generally no sufficient 
alone and several of them should be combined to achieve acceptable classification 
rates. We proposed in [7] to address this problem via feature selection using the 
filter mRMR approach . The main idea, recalled in Section [21 consists in 
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turning the scores into a large set of redundant binary features and in using a 
feature selection method to keep useful ones. This has the effect of finding good 
parameters and thresholds for the scores while retaining a reasonable number 
of scores, easing interpretation of the decision made by a Naive Bayes Classifier 
(NBC) built on them. In [7], the selection is done by a filter method. We 
investigate in the present paper wrapper based approaches. 

2 From scores to binary indicators 

We assume given a training set where the observations space X 

can be arbitrary while the target space is a finite set of classes, y = {l,...,itr}. 
We are also given a set of Q parametric scores, (,Sq)i<q<Q- Each Sq is a function 
from X X Wq to M, where Wq is the parameter space of the score. 

The main constraint of our context is that experts only allow score results 
to be used by the classifier. In addition, the semantic of the scores means that 
only decisions of the form Sq{Xi, Wq) < Xq are really meaningful. We propose to 
transform this set of scores into a much larger set of binary indicators. This is 
done by choosing for each score a finite subset of Wq, {Wq ,..., Wq’} and a finite 
set of thresholds {Aj,..., A,"}, and by defining pq x tq indicator functions by 
I^’*{X) = lsg(x,w^)<\t ■ This can be seen as a form of grid search in the “score 
space” in the sense that tuning the scores to the data set can be done indirectly 
by selecting relevant binary indicators (a similar principle is used in e.g. II])- 
By feeding the training set through the indicators, we obtain a 

new training set (i?i, Ei)i<i<Af where the Bi take values into {0,1}-^ where P is 
the total number of binary indicators generated from the scores. 

Contrarily to arbitrary binary valued variables, those indicators are intrinsi¬ 
cally highly redundant and correlated. For instance, if A* < A* , then = I 

implies {X) = 1. This has adverse consequences on feature selection methods 
and on the Naive Bayes Classifier. 

3 Naive Bayes Classifier 

The Naive Bayes Classifier (NBC, this e.g. [3]) is a very simple and robust 
classifier based on the (unrealistic) assumption that the features used to describe 
the objects to classify are conditionally independent given the class. In our 
context, this translates into P{B = b\Y = k) = 0^=1 P{B^ = P’\X = k), where 
B^ is the p-th indicator value in the indicator vector. This allows to estimate 
easily the posterior probability P{Y = k\B = b) as 

P{Y = k) nr-i P{BP = kP\Y = k) 

PCY = k\B = M = —- ^ ^^ 

’ P{BP = b) 

using estimated values of the P{BP = bP\Y = k). Those values are obtained by 
simple class conditional counts. 



In our context, the motivation for using the NBC is twofold. Its classical 
properties (simplicity, robustness and good performances) are of course a first 
motivation (it was used successfully in e.g. [I] with binary indicators). In 
addition, the actual classification is performed in a way that is very easy to 
interpret by a domain expert with limited machine learning expertise. It consists 
indeed in comparing posterior probabilities, which can be done on an indicator 
by indicator basis, by computing pj^gp^bplyZk^) ■ 

This allows to show to the user the indicators, and thus the underlying scores, 
that are the most important in one decision, by being the more discriminant 
between classes for a given observation (see [6] for a complete visual solution). 
In our application context, a black box decision model is unacceptable, while 
this kind of grey box decision is accepted by the domain experts, as long as the 
NBC is constructed from their scores. 

4 Feature selection for the NBC 

Selecting good features is of utmost importance to get good performances with 
a NBC. In addition, the type of decision analysis we mentioned in the previous 
section is only realistic if the number of features remains relatively small. 

Numerous feature selection methods have been investigated for NBC, ranging 
from wrapper forward search [3] to mRMR like filter approach as in mm, but 
more sophisticated search strategies (such as forward backward methods, see 
[2], chapter 4) have not. In addition, our specific context of highly redundant 
binary indicators remains also unexplored. It finally be noted that the solution 
recommended in text books remains a basic mutual information based filter 
approach (see e.g. HI)- 

4.1 Incremental calculation 

The main motivation of filter approaches is generally the large computational 
cost of wrapper solutions, as the latter tend to give better feature subsets than 
the former. Fortunately, the NBC structure allows one to implement forward or 
backward strategies in a rather efficient way. Indeed, the decision of a NBC is 
done by comparing posterior probabilities, which can be done equivalently by 
comparing the log likelihoods of the pair {B, k), for the different classes k: 

p 

log P{B = b,Y = k)= log P(Y = k) + ^logP(RP = }f\Y = k). 

p=i 

Given a feature subset of size m < P, this can be computed in 0{Nm) for all 
the (Bi,Yi) provided the full conditional distribution P{BP\Y = k) have been 
already computed (this is done in 0{NP) if K is small compared to P). Then 
the effect of adding or removing a feature can be computed by simply adding or 
subtracting to log P{B = b,Y = k) the contribution of the feature, that is in a 
total time in 0{N). Then evaluating all the features in a forward or backward 



step is in 0{NP) and thus the total cost of a forward (or backward) search is in 
0{NP‘^). Notice that this does not apply to arbitrary search strategies where 
one can move from a feature subset to a completely different one (such as in 
genetic algorithms). 

This is still an order of magnitude more expensive than e.g. a simple Mutual 
Information (MI) based forward search which costs 0{NP) but it is comparable 
to mRMR which costs also 0{NP^) when used to rank all the features. Com¬ 
pared to classifiers for which no incremental solution exists, forward or backward 
wrapper based approaches are then more affordable for the NBC. Indeed non 
incremental classifiers have generally at least a training cost in 0{Nm) for m 
features, leading to a 0{NP^) total cost. 

4.2 Other NBC specific aspects 

In addition to the reasonable complexity of its wrapper solutions, the NBC has 
some specihc aspects that should be taken into account during feature selections. 
Firstly, the NBC is non monotonic as adding features can degrade (at lot) its 
performances (mostly because it cannot weight features). Thus branch-and- 
bound solutions cannot be used reliably. 

A second issue is the evaluation metric to be used during the search. In 
wrapper approaches, one uses in general the risk under consideration, that is the 
classification error in our case. However, because of the additive nature of the 
NBC, many features have either no effect or an identical one when they are added 
one by one to an existing set of features. In other words, the classification error 
is not sensitive enough to distinguish between some of the features. We propose 
therefore to use the error probability as estimated by the NBC itself as the quality 
measure during feature selection. For a feature subset S', this is ^ Ps{y ^ 
Yi\B = b) where Ps{Y ^ Yi\B = b) is the conditional probability estimated with 
the features from S by the NBC. Using this value amounts to taking into account 
the uncertainty in the decision as estimated by the NBC itself. It should be noted 
that the log conditional likelihood ^ogPs{Y = Yi\B — b) gives very poor 

results in this NBC context because of the conditional independence assumption, 
(results are not reported here for space reasons.) 

5 Experimental evaluation 

We compare in this section several feature selection scheme for the NBC used 
on the binary indicators obtained as explained in Section [21 

5.1 Data sets 

We use simulated data sets similar to the one used in [7] . The training set and the 
test set have identical characteristics: they are made of 6000 times series, with 
3000 normal examples (Gaussian white noise with a = 1 standard deviation) and 
3000 abnormal examples belong to three different classes (in equal proportion). 
The mean change anomaly consists in switching from a /r = 0 mean white noise 


to a /i G [1,5] white noise. The variance change anomaly consists in switching 
from a cr = 1 standard deviation white noise to a tr G [2,6] white noise. Finally, 
the trend shift anomaly adds a linear trend to the signal from the change point 
to the end of the time series, with a final trend amplitude in [1,5]. Signal lengths 
are chosen uniformly at random in [100, 200] time steps, while the change point 
happens in the 60 % central area of the signal (e.g. in [20,80] for a length 100 
signal). 

The scores are based on sliding windows on which two population tests are 
conducted (as explained in Section [T]). We use the U test, the Kolmogorov- 
Smirnov test and the F-test (variance test). The parameter is in all cases the 
window length. We use also confirmatory scores based on successive windows. 
Details on the parameter values and on the confirmation scores can be found in 
[7]. After the binarization process described in Section]^ has been applied, we 
obtain in this context 814 indicators. 

5.2 General procedure and evaluation 

For each feature selection method, the NBC is built on half of the training set 
(keeping class proportions) and the best feature subset is selected using the 
second half of the training set (by choosing the smallest subset among those 
that have the lowest classification error). The feature subset is then evaluated 
on the test set by reporting the classification error. 

5.3 Feature selection techniques 

We use two filter procedures as reference, namely a simple Mutual Information 
(MI) feature ranking, and the mRMR ranking l^. All the other methods are 
wrapper approaches using either the classification error or the error probability 
as performance measure. We compare a forward search (at each step, the best 
feature is added to the feature set), a backward search (at each step, the worst 
feature is removed from the feature set) and full forward/backward search (also 
called floating search in El). In those algorithms, a forward phase is followed by 
a backward phase (and vice versa) until the results do not improve. For instance, 
one starts by a backward search to find a first optimal subset, then proceeds to 
a forward search from this subset to get a better one (with more variables). In 
case of improvement, the procedure is restarted from the last subset (backward, 
then forward, etc.). 

6 Results and discussion 

Results are summarized in table[I] As expected, the wrapper approaches outper¬ 
form the filter ones. In addition, the high redundancy of the binary indicators, 
implied by their constructions, has strong adverse effects on the MI filter method 
as it tends to select very redundant indicators. The mRMR ranking avoids this 
effect but obtains sub optimal results. 


Method 

Perf. Measure 

# of features 

test error 

MI filter 

Error 

422 

0.1387 

mRMR filter 

Error 

19 

0.1435 

Forward search 

Error 

136 

0.1237 

Forward search 

Probability 

207 

0.1225 

Backward search 

Error 

27 

0.1308 

Backward search 

Probability 

86 

0.1283 

Forward-Backward 

Error 

92 

0.1238 

Forward-Backward 

Probability 

123 

0.1237 

Backward-Forward 

Error 

112 

0.1267 

Backward-Forward 

Probability 

122 

0.1168 


Table 1: Classification error obtained on the test set 


Using the error probability rather than the classification error always im¬ 
proves the results of the wrapper approaches. It allows a more accurate ordering 
of the features than cannot be inferred by the search procedure alone. 

Moving from a simple greedy search to a floating search improves the perfor¬ 
mances in the case of the backward search as it tends to select too few variables. 
In the case of the forward search, the performances are slightly degraded but 
the number of features is strongly reduced. 

Those results show that the filter approaches should be avoided for the NBC. 
They also show that the error probability should be used to guide the greedy 
search in order to get a more accurate ordering of the features at each step 
of the search. The greedy search wrapper procedures give rather comparable 
results with slightly increased performances for the floating search. In the highly 
redundant binary indicators context, the backward floating search guided by 
the error probability appears as the best solution, contrarily to the classical 
recommendations for the NBC (namely, using a MI filter). 
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